Method for accelerating fast fourier transform based on field programmable gate array

ABSTRACT

A method for accelerating fast Fourier transform (FFT) based on field programmable gate array is provided. A sequence requiring N-point FFT is decomposed equally into 4 subsequences. The 4 subsequences are processed through 4 parallel FFT intellectual property (IP) cores. Finally, an arithmetic operation is performed on the processed data and twiddle factor data pre-stored in a memory to obtain a result of the N-point FFT of an original sequence. An FFT decomposition module, a twiddle factor storage module, and an operation processing module are provided. Through the processing method, a time delay consumed by an N-point FFT operation can be reduced, and excellent application value can be achieved in a high-speed digital signal processing system.

TECHNICAL FIELD

The present disclosure belongs to the field of digital signal processing, and relates to a method for accelerating fast Fourier transform based on field programmable gate array (FPGA).

BACKGROUND ART

Fast Fourier transform (FFT), as the most basic signal processing and analysis tool in digital signal processing, may transform a signal from a time domain to a frequency domain, and is widely used in communications and image processing. However, at present, as the demand for data continues to increase, the problem of long time delay brought by an FFT intellectual property (IP) core provided by an FPGA makes it impossible to meet application requirements in a high-speed system. Taking a high-speed communication system as an example, in an in-phase quadrature (IQ) imbalance and channel estimation module, no matter an estimation of frequency-domain equalization with a time domain or an estimation of time-domain equalization with a frequency domain, the FFT IP core is required for auxiliary calculation. When an estimated parameter value is converged in an iterative manner in the high-speed system, the problem of long time delay in the FFT IP core causes calculated data to be covered, which in turn affects the performance of the system.

Therefore, in view of the defects in the conventional art, it is necessary to provide a method for accelerating fast Fourier transform based on FPGA, so as to solve the problem of long time delay in the conventional art.

SUMMARY

The present disclosure provides a method for accelerating fast Fourier transform based on FPGA. An FFT decomposition module, a twiddle factor storage module, and an operation processing module are provided, an FFT is decomposed into four FFT submodules for processing, and four parallel data are processed in a deep pipeline architecture according to four formulas, thereby greatly reducing a time delay of operation.

To solve problems existing in the conventional art, technical solutions of the present disclosure are as follows:

A method for accelerating fast Fourier transform based on FPGA is provided. The FPGA is internally provided with an FFT decomposition module, a twiddle factor storage module, and an arithmetic operation processing module. The method includes:

step S1: decomposing, by the FFT decomposition module, an input sequence having a length of N into 4 subsequences having a length of N/4, and synchronously performing N/4-point FFT;

step S2: outputting, by the twiddle factor storage module, pre-stored twiddle factor data by determining an external input signal for performing arithmetic operation; and

step S3: processing, by the arithmetic operation processing module, data output by the FFT decomposition module and the twiddle factor storage module to obtain a result of N-point FFT of an original sequence;

where, the FFT decomposition module includes a sequence decomposition module and four parallel N/4-point FFT submodules, where the N/4-point FFT submodules of the FFT decomposition module are FFT intellectual property (IP) cores provided by the FPGA, and an output mode of the FFT IP cores of the FFT decomposition module is sequential output; the twiddle factor storage module includes two control units respectively marked as ctr1 and ctr2, and three independent single-port block random access memories (BRAMs) respectively configured to store three different types of twiddle factor data; the three independent single-port BRAMs included in the twiddle factor storage module respectively have a depth of N/4 and are respectively marked as BRAM1, BRAM2, and BRAM3; and corresponding twiddle factor data stored in the three independent single-port BRAMs are as follows:

$\begin{matrix} {{{{WN}1} = {{e^{- j\frac{2\pi}{N}k}k} = 0}},{1\ldots},{{\frac{N}{4} - 1};}} & (1) \end{matrix}$ $\begin{matrix} {{{{WN}2} = {{e^{- j\frac{2\pi}{N}k}k} = \frac{N}{4}}},{\frac{N}{4} + 1},\ldots,{{\frac{N}{2} - 1};}} & (2) \end{matrix}$ and $\begin{matrix} {{{{WN}3} = {{e^{- j\frac{2\pi}{N/2}k}k} = 0}},{{{1\ldots\frac{N}{4}} - 1};}} & (3) \end{matrix}$

where, the step S1 further includes:

substep S11: upon determining that a signal fft_valid==1, decomposing the input sequence x(n) having the length of N into the 4 subsequences, x(m), x(m+2), x(m+1), and x(m+3), each having the length of N/4, where n=0, 1, . . . , N−1, and m=0, 1, . . . , N/4-1;

substep S12: transforming the 4 subsequences x(m), x(m+2), x(m+1), and x(m+3) through four parallel N/4-point FFT IP cores respectively, to obtain 4 output sequences z₁(m), z₂(m), z₃(m), and z₄(m), where m=0, 1, . . . , N/4-1; z₁(m) corresponds to an output of x(m), z₂(m) corresponds to an output of x(m+2), z₃(m) corresponds to an output of x(m+1), and z₄(m) corresponds to an output of x(m+3); and

substep S13: outputting synchronously a data valid signal data1_valid to the twiddle factor storage module while outputting the sequences z₁(m), z₂(m), z₃(m), and z₄(m), which is valid at a high level.

The step S2 further includes:

substep S21: sequentially storing twiddle factors shown in formula (1), formula (2), and formula (3) into three independent memory cells BRAM1, BRAM2, and BRAM3 in a binary form according to an increasing order of k values;

substep S22: when the control unit ctr1 determines that the signal data1_valid==1, generating a read enable signal rd1 and a read address counter addr1 of BRAM3, sequentially acquiring the twiddle factor data of WN3, and transmitting the twiddle factor data of WN3 to the operation processing module; and

substep S23: when the control unit ctr2 determines that a signal data2 valid==1, generating a read enable signal rd2 and a read address counter addr2 of BRAM1 and BRAM2, sequentially acquiring the twiddle factor data of WN1 and WN2, and transmitting the twiddle factor data of WN1 and WN2 to the operation processing module.

The step S3 further includes:

substep S31: performing a delay beating on the sequences z₂(m) and z₄(m) to wait for the twiddle factor data of WN3; and multiplying the sequences z₂(m) and z₄(m) by the twiddle factor data of WN3 in sequence in a pipeline architecture to obtain sequences z₂₁(m) and z₄₁(m) having a length of N/4, where m=0, 1, . . . , N/4−1;

substep S32: performing the delay beating on the sequences z₁(m) and z₃(m) to wait for z₂₁(m) and z₄₁(m); adding z₁(m) and z₂₁(m) in sequence in the pipeline architecture to obtain a sequence Z₁(m) having a length of N/4, where m=0, 1, . . . , N/4−1; performing a subtraction operation on z₁(m) and z₂₁(m) in sequence in the pipeline architecture, by subtracting z₂₁(m) from z₁(m) to obtain a sequence Z₂(m) having a length of N/4, where m=0, 1, . . . , N/4−1; adding z₃(m) and z₄₁(m) in sequence in the pipeline architecture to obtain a sequence Z₃(m) having a length of N/4, where m=0, 1, . . . , N/4−1; and performing a subtraction operation on z₃(m) and z₄₁(m) in sequence in the pipeline architecture, by subtracting z₄₁(m) from z₃(m) to obtain a sequence Z₄(m) having a length of N/4, where m=0, 1, . . . , N/4−1;

substep S33: outputting synchronously a data valid signal data2 valid to the twiddle factor storage module while obtaining Z₃(m) and Z₄(m), which is valid at a high level;

substep S34: performing the delay beating on the sequences Z₃(m) and Z₄(m) to wait for the twiddle factor data of WN1 and the twiddle factor data of WN2; multiplying the sequence Z₃(m) and the twiddle factor data of WN1 in sequence in the pipeline architecture to obtain a sequence Z₃₁(m) having a length of N/4, where m=0, 1, . . . , N/4−1; and multiplying the sequence Z₄(m) and the twiddle factor data of WN2 in sequence in the pipeline architecture to obtain a sequence Z₄₁(m) having a length of N/4, where m=0, 1, . . . , N/4−1;

substep S35: performing the delay beating on the sequences Z₁(m) and Z₂(m) to wait for Z₃₁(m) and Z₄₁(m); adding the sequences Z₁(m) and Z₃₁(m) in sequence in the pipeline architecture to obtain a sequence y₁(n) having a length of N/4, where n=0, 1, . . . , N/4−1; adding the sequences Z₂(m) and Z₄₁(m) in sequence in the pipeline architecture to obtain a sequence y₂(n) having a length of N/4, where n=0, 1, . . . , N/4−1; performing a subtraction operation on the sequences Z₁(m) and Z₃₁(m) in sequence in the pipeline architecture, by subtracting Z₃₁(m) from Z₁(m) to obtain a sequence y₃(n) having a length of N/4, where n=0, 1, . . . , N/4−1; and performing a subtraction operation on the sequences Z₂(m) and Z₄₁(m) in sequence in the pipeline architecture, by subtracting Z₄₁(m) from Z₂(m) to obtain a sequence y₄(n) having a length of N/4, where n=0, 1, . . . , N/4−1; and

substep S36: outputting synchronously a data valid signal dout_valid while outputting y₁(m), y₂(m), y₃(m), and y₄(m), which is valid at a high level.

Compared with the conventional art, the present disclosure has the following technical effects:

The FFT decomposition module, the twiddle factor storage module, and the operation processing module are provided, the FFT is decomposed into four FFT submodules for processing, and four channels of parallel data are processed in a deep pipeline architecture according to four formulas, thereby greatly reducing the time delay of the operation.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is an overall schematic diagram of a method for accelerating fast Fourier transform based on FPGA according to the present disclosure;

FIG. 2 is a schematic diagram of an FFT decomposition module according to the present disclosure;

FIG. 3 is a schematic diagram of a twiddle factor storage module according to the present disclosure; and

FIG. 4 is a schematic diagram of an operation processing module according to the present disclosure.

The present disclosure is further described below with reference to the following specific embodiments and the above accompanying drawings.

DETAILED DESCRIPTION OF THE EMBODIMENTS

Specific implementations of the present disclosure are described below in detail with reference to the accompanying drawings:

FIG. 1 is an overall schematic diagram of a method for accelerating fast Fourier transform based on FPGA according to the present disclosure. An FFT decomposition module, a twiddle factor storage module, and an operation processing module are provided. FIG. 2 is a schematic diagram of the FFT decomposition module according to the present disclosure. The FFT decomposition module includes a sequence decomposition module and four parallel N/4-point FFT submodules. The N/4-point FFT submodules of the FFT decomposition module are FFT intellectual property (IP) cores provided by the FPGA, and an output mode of the FFT IP cores of the FFT decomposition module is sequential output. FIG. 3 is a schematic diagram of the twiddle factor storage module according to the present disclosure. The twiddle factor storage module includes two control units respectively marked as ctr1 and ctr 2, and three independent single-port block random access memories (BRAMs) respectively configured to store three different types of twiddle factor data; the three independent single-port BRAMs included in the twiddle factor storage module respectively have a depth of N/4 and are respectively marked as BRAM1, BRAM2, and BRAM3. Corresponding twiddle factor data stored in the three independent single-port BRAMs are as follows:

$\begin{matrix} {{{{WN}1} = {{e^{- j\frac{2\pi}{N}k}k} = 0}},{1\ldots},{{\frac{N}{4} - 1};}} & (1) \end{matrix}$ $\begin{matrix} {{{{WN}2} = {{e^{- j\frac{2\pi}{N}k}k} = \frac{N}{4}}},{\frac{N}{4} + 1},\ldots,{{\frac{N}{2} - 1};}} & (2) \end{matrix}$ and $\begin{matrix} {{{{WN}3} = {{e^{- j\frac{2\pi}{N/2}k}k} = 0}},{{{1\ldots\frac{N}{4}} - 1};}} & {(3).} \end{matrix}$

The specific implementation of the present disclosure includes steps S1-S3.

Step S1: the FFT decomposition module decomposes an input sequence having a length of N into 4 subsequences having a length of N/4, and N/4 point FFT is synchronously performed.

Step S2: the twiddle factor storage module outputs pre-stored twiddle factor data by determining an external input signal for performing arithmetic operation; and

Step S3: the arithmetic operation processing module processes data output by the FFT decomposition module and the twiddle factor storage module to obtain a result of N-point FFT of an original sequence.

FIG. 2 is a schematic diagram of the FFT decomposition module according to the present disclosure. The step S1 further includes substeps S11-S13.

Substep S11: upon determining that a signal fft_valid==1, the input sequence x(n) having the length of N is decomposed into the 4 subsequences, x(m), x(m+2), x(m+1), and x(m+3), each having the length of N/4, where n=0, 1, . . . , N−1, and m=0, 1, . . . , N/4−1.

Substep S12: the 4 subsequences x(m), x(m+2), x(m+1), and x(m+3) are transformed through four parallel N/4-point FFT IP cores to obtain 4 output sequences z₁(m), z₂(m), z₃(m), and z₄(m), where m=0, 1, . . . , N/4−1; z₁(m) corresponds to an output of x(m), z₂(m) corresponds to an output of x(m+2), z₃(m) corresponds to an output of x(m+1), and z₄(m) corresponds to an output of x(m+3).

Substep S13: a data valid signal data1_valid is output synchronously to the twiddle factor storage module while the sequences z₁(m), z₂(m), z₃(m), and z₄(m) are output, which is valid at a high level.

FIG. 3 is a schematic diagram of the twiddle factor storage module according to the present disclosure. The step S2 further includes substeps S21-S23.

Substep S21: twiddle factors shown in formula (1), formula (2), and formula (3) are sequentially stored into three independent memory cells BRAM1, BRAM2, and BRAM3 in a binary form according to an increasing order of k values.

Substep S22: when the control unit ctr1 determines that the signal data1_valid==1, a read enable signal rd1 and a read address counter addr1 of BRAM3 are generated, twiddle factor data of WN3 are acquired sequentially and transmitted to the operation processing module.

Substep S23: when the control unit ctr2 determines that a signal data2 valid==1, a read enable signal rd2 and a read address counter addr2 of BRAM1 and BRAM2 are generated, twiddle factor data of WN1 and WN2 are acquired sequentially and transmitted to the operation processing module.

FIG. 4 is a schematic diagram of the operation processing module according to the present disclosure. In the step S3, a corresponding result of the N-point FFT is obtained by performing operation processing according to formula (4), formula (5), formula (6), and formula (7), where y₁(m) corresponds to bits 1 to N/4 of the result of the N-point FFT in sequence, y₂(m) corresponds to bits N/4+1 to N/2 of the result of the N-point FFT in sequence, y₃(m) corresponds to bits N/2+1 to 3N/4 of the result of the N-point FFT in sequence, and y₄(m) corresponds to bits 3N/4+1 to N of the result of the N-point FFT in sequence;

$\begin{matrix} {{{y_{1}(m)} = {{{z_{1}(m)} + {e^{- j\frac{2\pi}{N/2}m}{z_{2}(m)}} + {{e^{- j\frac{2\pi}{N}m}\left( {{z_{3}(m)} + {e^{- j\frac{2\pi}{N/2}m}{z_{4}(m)}}} \right)}m}} = 0}},1,\ldots,{{\frac{N}{4} - 1};}} & (4) \end{matrix}$ $\begin{matrix} {{{y_{2}(m)} = {{{z_{1}(m)} - {e^{- j\frac{2\pi}{N/2}m}{z_{2}(m)}} + {{e^{- j\frac{2\pi}{N}{({m + \frac{N}{4}})}}\left( {{z_{3}(m)} - {e^{- j\frac{2\pi}{N/2}m}{z_{4}(m)}}} \right)}m}} = 0}},1,\ldots,{{\frac{N}{4} - 1};}} & (5) \end{matrix}$ $\begin{matrix} {{{y_{3}(m)} = {{{z_{1}(m)} + {e^{- j\frac{2\pi}{N/2}m}{z_{2}(m)}} - {{e^{- j\frac{2\pi}{N}m}\left( {{z_{3}(m)} + {e^{- j\frac{2\pi}{N/2}m}{z_{4}(m)}}} \right)}m}} = 0}},1,\ldots,{{\frac{N}{4} - 1};}} & (6) \end{matrix}$ $\begin{matrix} {{{y_{4}(m)} = {{{z_{1}(m)} - {e^{- j\frac{2\pi}{N/2}m}{z_{2}(m)}} - {{e^{- j\frac{2\pi}{N}{({m + \frac{N}{4}})}}\left( {{z_{3}(m)} - {e^{- j\frac{2\pi}{N/2}m}{z_{4}(m)}}} \right)}m}} = 0}},1,\ldots,{{\frac{N}{4} - 1};}} & (7) \end{matrix}$

The step S3 further includes substep S31-S36.

Substep S31: a delay beating is performed on the sequences z₂(m) and z₄(m) to wait for the twiddle factor data of WN3; and the sequences z₂(m) and z₄(m) are multiplied by the twiddle factor data of WN3 in sequence in a pipeline architecture to obtain sequences z₂₁(m) and z₄₁(m) having a length of N/4, where m=0, 1, . . . , N/4−1.

Substep S32: the delay beating is performed on the sequences z₁(m) and z₃(m) to wait for z₂₁(m) and z₄₁(m); z₁(m) and z₂₁(m) are added in sequence in the pipeline architecture to obtain a sequence Z₁(m) having a length of N/4, where m=0, 1, . . . , N/4−1; a subtraction operation is performed on z₁(m) and z₂₁(m) in sequence in the pipeline architecture, i.e., z₂₁(m) is subtracted from z₁(m) to obtain a sequence Z₂(m) having a length of N/4, where m=0, 1, . . . , N/4−1; z₃(m) and z₄₁(m) are added in sequence in the pipeline architecture to obtain a sequence Z₃(m) having a length of N/4, where m=0, 1, . . . , N/4−1; and a subtraction operation is performed on z₃(m) and z₄₁(m) in sequence in the pipeline architecture, i.e., z₄₁(m) is subtracted from z₃(m) to obtain a sequence Z₄(m) having a length of N/4, where m=0, 1, . . . , N/4−1.

Substep S33: a data valid signal data2 valid is output synchronously to the twiddle factor storage module while Z₃(m) and Z₄(m) are obtained, which is valid at a high level.

Substep S34: the delay beating is performed on the sequences Z₃(m) and Z₄(m) to wait for the twiddle factor data of WN1 and the twiddle factor data of WN2; the sequence Z₃(m) and the twiddle factor data of WN1 are multiplied in sequence in the pipeline architecture to obtain a sequence Z₃₁(m) having a length of N/4, where m=0, 1, . . . , N/4−1; and the sequence Z₄(m) and the twiddle factor data of WN2 are multiplied in sequence in the pipeline architecture to obtain a sequence Z₄₁(m) having a length of N/4, where m=0, 1, . . . , N/4−1.

Substep S35: the delay beating is performed on the sequences Z₁(m) and Z₂(m) to wait for Z₃₁(m) and Z₄₁(m); the sequences Z₁(m) and Z₃₁(m) are added in sequence in the pipeline architecture to obtain a sequence y₁(n) having a length of N/4, where n=0, 1, . . . , N/4−1; the sequences Z₂(m) and Z₄₁(m) are added in sequence in the pipeline architecture to obtain a sequence y₂(n) having a length of N/4, where n=0, 1, . . . , N/4−1; a subtraction operation is performed on the sequences Z₁(m) and Z₃₁(m) in sequence in the pipeline architecture, i.e., Z₃₁(m) is subtracted from Z₁(m) to obtain a sequence y₃(n) having a length of N/4, where n=0, 1, . . . , N/4−1; and a subtraction operation is performed on the sequences Z₂(m) and Z₄₁(m) in sequence in the pipeline architecture, i.e., Z₄₁(m) is subtracted from Z₂(m) to obtain a sequence y₄(n) having a length of N/4, where n=0, 1, . . . , N/4−1.

Substep S36: a data valid signal dout_valid is output synchronously while y₁(m), y₂(m), y₃(m), and y₄(m) are output, which is valid at a high level.

The above are the specific implementation steps explained by the inventor in conjunction with the examples, and the present disclosure is applicable to a digital baseband module and a digital image processing module of an FPGA-based communication system. It should be pointed out that those skilled in the art could improve and perfect the method without departing from the present disclosure, but it should be understood that the above examples do not impose restrictions on the protection scope of the present disclosure, and any improvement and perfection based on the present disclosure should fall within the protection scope of the present disclosure. 

What is claimed is:
 1. A method for accelerating fast Fourier transform based on field-programmable gate array (FPGA), wherein the FPGA is internally provided with a fast Fourier transform (FFT) decomposition module, a twiddle factor storage module, and an arithmetic operation processing module; and the method comprises: step S1: decomposing, by the FFT decomposition module, an input sequence having a length of N into 4 subsequences having a length of N/4, and synchronously performing N/4-point FFT; step S2: outputting, by the twiddle factor storage module, pre-stored twiddle factor data by determining an external input signal for performing arithmetic operation; and step S3: processing, by the arithmetic operation processing module, data output by the FFT decomposition module and the twiddle factor storage module to obtain a result of N-point FFT of an original sequence; wherein the FFT decomposition module comprises a sequence decomposition module and four parallel N/4-point FFT submodules, wherein the N/4-point FFT submodules are FFT intellectual property (IP) cores provided by the FPGA, with an output mode of sequential output; the twiddle factor storage module comprises two control units respectively marked as ctr1 and ctr2, and three independent single-port block random access memories (BRAMs) respectively configured to store three different types of twiddle factor data; the three independent single-port BRAMs respectively have a depth of N/4 and are respectively marked as BRAM1, BRAM2, and BRAM3; and corresponding twiddle factor data stored in the three independent single-port BRAMs are as follows: $\begin{matrix} {{{{WN}1} = {{e^{- j\frac{2\pi}{N}k}k} = 0}},{1\ldots},{{\frac{N}{4} - 1};}} & (1) \end{matrix}$ $\begin{matrix} {{{{WN}2} = {{e^{- j\frac{2\pi}{N}k}k} = \frac{N}{4}}},{\frac{N}{4} + 1},\ldots,{{\frac{N}{2} - 1};}} & (2) \end{matrix}$ and $\begin{matrix} {{{{WN}3} = {{e^{- j\frac{2\pi}{N/2}k}k} = 0}},{{{1\ldots\frac{N}{4}} - 1};}} & (3) \end{matrix}$ wherein the step S1 comprises: substep S11: upon determining that a signal fft_valid==1, decomposing the input sequence x(n) having the length of N into the 4 subsequences, x(m), x(m+2), x(m+1), and x(m+3), each having the length of N/4, wherein n=0, 1, . . . , N−1, and m=0, 1, . . . , N/4−1; substep S12: transforming the 4 subsequences x(m), x(m+2), x(m+1), and x(m+3) through four parallel N/4-point FFT IP cores respectively, to obtain 4 output sequences z₁(m), z₂(m), z₃(m), and z₄(m), wherein m=0, 1, . . . , N/4−1; z₁(m) corresponds to an output of x(m), z₂(m) corresponds to an output of x(m+2), z₃(m) corresponds to an output of x(m+1), and z₄(m) corresponds to an output of x(m+3); and substep S13: outputting synchronously a data valid signal data1_valid to the twiddle factor storage module while outputting the sequences z₁(m), z₂(m), z₃(m), and z₄(m).
 2. The method for accelerating the fast Fourier transform based on the FPGA according to claim 1, wherein the step S2 further comprises: substep S21: sequentially storing corresponding twiddle factors into three independent memory cells BRAM1, BRAM2, and BRAM3 in a binary form according to an increasing order of k values; substep S22: when the control unit ctr1 determines that the signal data1_valid==1, generating a read enable signal rd1 and a read address counter addr1 of BRAM3, sequentially acquiring the twiddle factor data of WN3, and transmitting the twiddle factor data of WN3 to the operation processing module; and substep S23: when the control unit ctr2 determines that a signal data2 valid==1, generating a read enable signal rd2 and a read address counter addr2 of BRAM1 and BRAM2, sequentially acquiring the twiddle factor data of WN1 and WN2, and transmitting the twiddle factor data of WN1 and WN2 to the operation processing module.
 3. The method for accelerating the fast Fourier transform based on the FPGA according to claim 1, wherein step S3 further comprises: substep S31: performing a delay beating on the sequences z₂(m) and z₄(m) to wait for the twiddle factor data of WN3; and multiplying the sequences z₂(m) and z₄(m) by the twiddle factor data of WN3 in sequence in a pipeline architecture to obtain sequences z₂₁(m) and z₄₁(m) having a length of N/4, wherein m=0, 1, . . . , N/4−1; substep S32: performing the delay beating on the sequences z₁(m) and z₃(m) to wait for z₂₁(m) and z₄₁(m); adding z₁(m) and z₂₁(m) in sequence in the pipeline architecture to obtain a sequence Z₁(m) having a length of N/4, wherein m=0, 1, . . . , N/4−1; performing a subtraction operation on z₁(m) and z₂₁(m) in sequence in the pipeline architecture, by subtracting z₂₁(m) from z₁(m) to obtain a sequence Z₂(m) having a length of N/4, wherein m=0, 1, . . . , N/4−1; adding z₃(m) and z₄₁(m) in sequence in the pipeline architecture to obtain a sequence Z₃(m) having a length of N/4, wherein m=0, 1, . . . , N/4−1; and performing a subtraction operation on z₃(m) and z₄₁(m) in sequence in the pipeline architecture, by subtracting z₄₁(m) from z₃(m) to obtain a sequence Z₄(m) having a length of N/4, wherein m=0, 1, . . . , N/4−1; substep S33: outputting synchronously a data valid signal data2_valid to the twiddle factor storage module while obtaining Z₃(m) and Z₄(m); substep S34: performing the delay beating on the sequences Z₃(m) and Z₄(m) to wait for the twiddle factor data of WN1 and the twiddle factor data of WN2; multiplying the sequence Z₃(m) and the twiddle factor data of WN1 in sequence in the pipeline architecture to obtain a sequence Z₃₁(m) having a length of N/4, wherein m=0, 1, . . . , N/4−1; and multiplying the sequence Z₄(m) and the twiddle factor data of WN2 in sequence in the pipeline architecture to obtain a sequence Z₄₁(m) having a length of N/4, wherein m=0, 1, . . . , N/4−1; substep S35: performing the delay beating on the sequences Z₁(m) and Z₂(m) to wait for Z₃₁(m) and Z₄₁(m); adding the sequences Z₁(m) and Z₃₁(m) in sequence in the pipeline architecture to obtain a sequence y₁(n) having a length of N/4, wherein n=0, 1, . . . , N/4−1; adding the sequences Z₂(m) and Z₄₁(m) in sequence in the pipeline architecture to obtain a sequence y₂(n) having a length of N/4, wherein n=0, 1, . . . , N/4−1; performing a subtraction operation on the sequences Z₁(m) and Z₃₁(m) in sequence in the pipeline architecture, by subtracting Z₃₁(m) from Z₁(m) to obtain a sequence y₃(n) having a length of N/4, wherein n=0, 1, . . . , N/4−1; and performing a subtraction operation on the sequences Z₂(m) and Z₄₁(m) in sequence in the pipeline architecture, by subtracting Z₄₁(m) from Z₂(m) to obtain a sequence y₄(n) having a length of N/4, wherein n=0, 1, . . . , N/4−1; and substep S36: outputting synchronously a data valid signal dout_valid while outputting y₁(m), y₂(m), y₃(m), and y₄(m). 